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<^ ■ Abstract 

Human activities comprise several sub-activities performed in a sequence and in- 
volve interactions with various objects. This makes reasoning about the object 
affordances a central task for activity recognition. In this work, we consider the 
problem of jointly labeling the object affordances and human activities from RGB- 
D videos. We frame the problem as a Markov Random Field where the nodes 
represent objects and sub-activities, and the edges represent the relationships be- 
tween object affordances, their relations with sub-activities, and their evolution 
^ ' over time. We formulate the learning problem using a structural SVM approach, 

where labeling over various alternate temporal segmentations are considered as la- 
tent variables. We tested our method on a dataset comprising 120 activity videos 
collected from four subjects, and obtained an end-to-end precision of 81.8% and 
\ recall of 80.0% for labeling the activities. 

On 1 

O \ 1 Introduction 

qq ■ In this paper, we present a learning algorithm that takes as input an RGB-D video (obtained from 

an inexpensive sensor such as Microsoft Kinect), and identifies the human activities taking place 
{vq | over long time periods (e.g., see Fig. [2]). Most prior work in human activity detection has focussed 

on activity detection from still images or from 2D videos. Estimating the human pose is the pri- 
mary focus of these works, and they consider activities taking place over shorter time scales (see 
Section [2]). Having access to a 3D camera, which provides RGB-D videos, enables us to robustly 
estimate human poses and use this information for learning complex human activities. 

Our focus in this work is to recognize complex human activities that take place over long time scales 
and that consist of a long sequence of sub-activities, such as making cereal and arranging objects in 
a room. For example, making cereal activity consists of around 12 sub-activities on average, which 
includes reaching the pitcher, moving the pitcher to the bowl, and then pouring the milk into the 
bowl. This proves to be a very challenging task given the variability across individuals in perform- 
ing each sub-activity, and other environment induced conditions such as cluttered background and 
viewpoint changes. (See Fig.[T]for some examples.) 

In most previous works, object detection and activity recognition have been addressed as separate 
tasks. Only recently, some works have shown that modeling mutual context is beneficial QUI]. 
The key idea in our work is to note that, in activity detection, it is sometimes more informative to 
know how an object is being used (associated affordances, [3]) rather than knowing what the object 
is (i.e., the object category). For example, both chair and sofa might be categorized as 'sittable,' 
and a cup might be categorized as both 'drinkable' and 'pourable.' Note that the affordances of an 
object change over time depending on its use, e.g., a pitcher may first be reachable, then movable 
and finally pourable. In addition to helping activity recognition, recognizing object affordances is 
important by itself because of their use in robotic applications (e.g., H]). 

We propose a method to learn human activities by modeling the sub-activities and affordances of 
the objects, how they change over time, and how they relate to each other. More formally, we define 
a Markov Random Field over two kinds of nodes: object and sub-activity nodes. The edges in the 
graph model the pairwise relations among interacting nodes, namely the object-object interactions, 
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Figure 1: Example shots of reaching sub-activity from our dataset. Top row shows the RGB image, and the 
bottom row shows the corresponding depth images from the RGB-D camera. Note that there are significant 
variations in the way the subjects perform the sub-activity. In addition, there is significant background clutter 
and subjects are partially occluded (e.g., column 1) or not facing the camera (e.g., column 4) in many instances. 

object-sub-activity interaction, and the temporal interactions. This model is built with each spatio- 
temporal segment being a node. The parameters of this model are learnt using a structural SVM 
formulation [5]. Given a new sequence of frames, we label the high-level activity, all the sub- 
activities and the object affordances using our learned model. 

However, the activities take place over a long time-scale, and different people execute sub-activities 
differently and for different periods of time. Furthermore, people also often merge two consecutive 
sub-activities together. Thus, segmentations in time are noisy and in fact, there may not be one 
'correct' segmentation, especially at the boundaries. One approach could be to consider all possible 
segmentations, and marginalize the segmentation; however, this is computationally infeasible. In 
this work, we perform sampling of several segmentations, and consider labelings over these temporal 
segments as latent variables in our learning algorithm. 

In extensive experiments over 120 activity videos collected from four subjects, we showed that our 
approach outperforms the baselines in both the tasks of activity as well as affordance detection. We 
achieved an accuracy of 91.8% for affordance, 86.0% for sub-activity labeling and 84.7% for high- 
level activities respectively when given the ground truth segmentation, and an end-to-end accuracy 
of 83.9%, 68.2% and 80.6% on these respective tasks. 

2 Related Work 

There has been a lot of work on human activity detection from images |@, 0] and from videos 
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O. Here, we discuss works that are closely related to ours, and refer 
the reader to fl7ll for a survey of the field. Most works (e.g., lHHEEl]) consider detecting actions 
at a 'sub-activity' level (e.g., walk, bend, and draw) instead of considering high-level activities. 
Their methods range from discriminative learning techniques for joint segmentation and recognition 
djltlJ] to combining multiple models [13]. Some works such as [14] consider high-level activities. 
Tang et. al. lfl4ll propose a latent model for high-level activity classification and have the advantage 
of requiring only high-level activity labels for learning. None of these methods explicitly consider 
the role of objects or object affordances that not only help in identifying sub-activities and high-level 
activities, but are also important for several robotic applications (e.g., Hfl). 

Some recent works fl 0, [Hi [Hi \T$\ show that modeling the interaction between human poses and 
objects in 2D videos results in a better performance on the tasks of object detection and activity 
recognition. However, these works cannot capture the rich 3D relations between the activities and 
objects, and are also fundamentally limited by the quality of the human pose inferred from the 2D 
data. More importantly, for activity recognition, the object affordance matters more than its category. 

Kjellstrom et. al. [I20ll used a Factorial CRF to simultaneously segment and classify human hand 
actions, as well as classify the object affordances involved in the activity from 2D videos. However, 
this work is limited to classifying only hand actions and does not model interactions between the 
objects. We consider complex full-body activities and show that modeling object-object interactions 
is important as objects have affordances even if they are not directly interacted with human hands. 

Recently, with the availability of inexpensive RGB-D sensors, some works consider labeling and 
object recognition in 3D point-clouds [21, 22]. Sung et. al. Ii23ll considered activity recognition from 
RGB-D videos. They propose a hierarchical maximum entropy Markov model to detect activities 
from RGB-D videos and treat the sub-activities as hidden nodes in their model. However, they use 
only human pose information for detecting activities and also constrain the number of sub-activities 
in each activity. In contrast, we model context from object interactions along with human pose, also 
present a better learning algorithm. (See Section[4]for further comparisons.) Gall et. al. [I24ll also use 
depth data to perform sub-activity (referred to as action) classification and functional categorization 
of objects. Their method first detects the sub-activity being performed using the estimated human 
pose from depth data, and then performs object localization and clustering of the objects into func- 
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Figure 2: Pictorial representation of the different types of nodes and relationships modeled in part of the 
cleaning objects activity comprising three sub-activities: reaching, opening and scrubbing. 

tional categories based on the detected sub-activity. In contrast, our proposed method performs joint 
sub-activity and affordance labeling and uses these labels to perform high-level activity detection. 

All above works lack a unified framework of combining all the information available in human 
interaction activities and therefore we propose a model that captures both the spatial and temporal 
relations between objects and human poses to perform joint object affordance and activity detection. 

3 Our Approach 

Our goal is to perform joint activity and object affordance labeling of RGBD videos. As illustrated 
in Fig. [2 we define a Markov Random Field (MRF) over the spatio-temporal sequence we get from 
an RGBD video. The MRF is represented as a graph Q = (V, £). There are two types of nodes in 
Q: objects nodes denoted by V G and sub-activity nodes denoted by V a . 

3.1 Representation 

If we build our graph with nodes for objects and sub-activities for each time instant (at 30fps), then 
we will end up with quite a large graph. Furthermore, such a graph would not be able to model 
meaningful transitions between the sub-activities because they take place over a long-time (e.g., a 
few seconds). Therefore, in our approach we first segment the video into small temporal segments, 
and our goal is to label each segment with appropriate labels. We try to over- segment, so that we 
end up with more segments and avoid merging two sub-activities into one segment. Each of these 
segments occupies a small length of time and therefore, considering nodes per segment gives us a 
meaningful and concise representation for the graph Q. With such a representation, we can model 
meaningful transitions of a sub-activity following another, e.g., pouring followed by moving. 

3.2 Overview of Properties Captured 

Over the course of a video, a human may interact with several objects and perform several sub- 
activities over time. In our MRF model (in Section [33]), these interactions are represented by the 
edges £, and we try to capture the following properties: 

Affordance - sub-activity relations. At any given time, the affordance of the object depends on the 
sub-activity it is involved in. For example, a cup has the affordance of 'pourable ' in a pouring sub- 
action and has the affordance of 'drinkable ' in a drinking sub-action. We compute relative geometric 
features between the object and the human's skeletal joints to capture this. 

Affordance - affordance relations. Objects have affordances even if they are not interacted directly 
with by the human, and their affordances depend on the affordances of other objects around them. 
E.g., in the case of pouring from a pitcher to a cup, the cup is not interacted by the human directly 
but has the affordance 'pour-to'. We therefore use relative geometric features such as "on top of, 
"nearby", "in front of ', etc., to model the affordance - affordance relations. 

Sub-activity change over time. Each activity consists of a sequence of sub-activities that change 
over the course of performing the activity. We model this by incorporating temporal edges in Q. 

Affordance change over time. The object affordances depend on the activity being performed and 
hence change along with the sub-activity over time. We model the temporal change in affordances 
of each object using features such as change in appearance and location of the object over time. 

3.3 Model 

We model the spatio-temporal structure of an activity using a model isomorphic to a Markov Ran- 
dom Field with log-linear node and pairwise edge potentials (see Fig. [2 for an illustration). Let K a 
denote the set of sub-activity labels, and K Q denote the set of object affordance labels. Given a 
temporally segmented 3D video x = ...,£jv) consisting of temporal segments x s , we aim to 
predict a labeling y = (yi , y^) for each segment. For a segmented 3D video x, the prediction y 
is computed as the argmax of a discriminant function / w (x, y) that is parameterized by weights w. 

y = argmax / w (x,y) (1) 
y 
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The discriminant function captures the dependencies between the sub-activity and object affordance 
labels as defined by an undirected graph Q = (V, £). We now describe the structure of this graph. 
For object nodes denoted by V Q and sub-activity nodes denoted by V a , let denote set of object 
nodes of segment s, and v s a denote the sub-activity node of segment s. For all segments s, there is 
an edge connecting all the nodes in to each other and to the sub-activity node v s a . These edges 
signify the relationships within the objects, and between the objects and the human pose within a 
segment and are referred to as 'object - object interactions' and 'sub-activity - object interactions' 
in the Fig. [2] respectively. 

The sub-activity node of segment s is connected to the sub-activity nodes in segments (s — 1) and 
(s + 1). Similarly every object node of segment s is connected to the corresponding object nodes 
in segments (s — 1) and (s + 1). These edges model the temporal interactions between the human 
poses and the objects respectively and represented by doted edges in the Fig. [2 

We categorize the edges into different types denoted by T. The various types are object - object 
edges and object - sub-activity edges and sub-activity - sub-activity edges. Let £ t denote the edges 
of type t GT and T t denote pairs of possible labels for the nodes connected by edges of type t. For 
example if the edge type t is object - sub-activity, then (/, k) G Tt, VZ G K QJ Vfc G K a . 

Let y\ be a binary variable representing the node i having label k, where k G K Q for object nodes 
and k G K a for sub-activity nodes. All k binary variables together represent the label of a node in 
the above described graph Q. A segment label y s is composed of two types of node labels: a set of 
object affordance labels, {yf : k G K Q ;\/i G V*}, and a sub-activity label {yf : k G K a ; i — v^}. 
Given Q, we define the following discriminant function based on individual node features (f> a (i) and 
<j>o(i) and edge features (j>t{i, j) as further described below. 

/w(y,x) = ^2 ^2 Vi [ W a • 0a («)] + ^2 ^2 y ^ [ W ° " 

ieV a k(EK a ieV k(EK Q 

+ ^2 ^2 ^2 y l iVj [ w t k - Mhj)] (2) 

tGT (i,j)££ t (l,k)£T t 

For the node feature maps, <j> a {i) and o (i), there is one weight vector for each of the sub-activity 
classes in K a and each of the object affordance classes in K Q respectively. There are multiple types 
t of the edge feature maps <j>t(i,j), each corresponding to a different type of edge in the graph. For 
each type, there is one weight vector each for every pair of labels the edge can take. 

Features. For a given object node i, the node feature map (j> (i) is a vector of features representing 
the object's location in the scene and how it changes within the temporal segment. These features 
include the (x, y, z) coordinates of the object's centroid and the coordinates of the object's bounding 
box at the middle frame of the temporal segment. We also run a SIFT feature based object tracker 
Ii25l1 to find the corresponding points between the adjacent frames and then compute the transforma- 
tion matrix based on the matched image points. We add the transformation matrix corresponding to 
the object in the middle frame with respect to its previous frame to the features in order to capture 
the object's motion information. In addition to the above features, we also compute the total dis- 
placement and the total distance moved by the object's centroid in the set of frames belonging to the 
temporal segment. We then perform cumulative binning of the feature values into 10 bins. In our 
experiments, we have (j> {i) G M 180 . 

Similarly, for a given sub-activity node i, the node feature map <j> a (i) gives a vector of features 
computed using the human skeleton information obtained from running Openni's skeleton tracker 
ll26h on the RGBD video. We compute the features described above for each the upper- skeleton joint 
(neck, torso, left shoulder, left elbow, left palm, right shoulder, right elbow and right palm) locations 
relative to the subject's head location, thus giving us (j> a {i) G R 1030 . 

The edge feature maps <fit(hj) describe the relationship between node i and j. For capturing the 
object-object relations within a temporal segment, we compute relative geometric features such as 
the difference in (x, y, z) coordinates of the object centroids and the distance between them. These 
features are computed at the first, middle and last frames of the temporal segment along with min 
and max of their values across all frames in the temporal segment to capture the relative motion 
information. This gives us j) G R 200 . Similarly for object-sub-activity relation features 
<t>2(h j) £ R 400 , we use the same features as for the object-object relation features, but we compute 
them between the upper- skeleton joints and the each object's centroid. The temporal relational 
features capture the change across temporal segments and we use the vertical change in position and 
the distance between by corresponding object and the joint locations. This gives us (j)s(i,j) G R 40 
and 04(i,j) G R 160 respectively. 
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Inference. Given the model parameters w, the inference problem is to find the best labeling y for 
a new video x, i.e., solving the argmax in Eq. (Q]) for the discriminant function in Eq. (|2]). This is 
a NP hard problem. However, its equivalent formulation as the following mixed-integer program 
has a linear relaxation which can be solved efficiently as a quadratic pseudo-Boolean optimization 
problem using a graph-cut method I127n . 

y = argmax max ^ ^ y\ \w k a • o (i)j +^ ^ yl \w k • o (i)j 

+ E E E 4 fe K-<Mu)] o) 

t£T(i,j)e£ t (i,k)er t 

Vz, J, /, & : 3$ < y • , z\j < yj, y\ + y) < z% + 1, z-*, j/- G {0, 1} (4) 
Note that the products have been replaced by auxiliary variables z\j. Relaxing the variables 
z\j and y\ to the interval [0, 1] results in a linear program that can be shown to always have half- 
integral solutions (i.e., y\ only take values {0, 0.5, 1} at the solution) [28]. Since every node in our 
experiments has exactly one class label, we also consider the linear relaxation from above with the 
additional constraints \/i E V a J2ieK v\ = 1 an d Vz E V Q ' J2ieK v\ ~ 1- This problem can 
no longer be solved via graph cuts. We compute the exact mixed integer solution including these 
additional constraint using a general-purpose MIP solveiEl during inference. The MIP solver takes 
10.7 seconds on an average for one video (a typical video has a graph with 17 sub-activity nodes 
and 592 object nodes, i.e., 6090 variables). 

Learning. We take a large-margin approach to learning the parameter vector w of Eq. © from 
labeled training examples (xi,yi), (xmJm) B221 [300 . Our method optimizes a regularized 

upper bound on the training error R(h) = 2™=i ^(y TO , y m ), where y m is the optimal solution 

of Eq. © and A(y, y) = £, eVo EkeK a \Vi ~ v\\ + E ieK EkeK a \Vi ~ v}\- To simplify 
notation, note that Eq. © can be equivalently written as w T \I/(x, y) by appropriately stacking the 
w% , w h Q and w l t h into w and the y^(j) a {i), y*-(j) (i) and z\j 4>t{i , j) into \I/(x, y), where each z\j is 
consistent with Eq. © given y. Training can then be formulated as the following convex quadratic 
program Dill : 

1 rp 

min - w w + C£ (5) 

w,£ 2 

i M 

s.t. Vyi, ...,y M € {0, 0.5, l}"'* : — w T ^ [*(x m , y m ) - *(x m , y m )] > A(y m , y m ) - £ 

While the number of constraints in this QP is exponential in M, TV and if, it can nevertheless be 
solved efficiently using the cutting-plane algorithm [31]. The algorithm needs access to an efficient 
method for computing 

y m argmax w T ^(x m , y) + A(y m , y) . (6) 

yG{0, 0.5,1}^-^ L J 

Due to the structure of A(. , .) , this problem is identical to the relaxed prediction problem in Eqs. ©- 
© and can be solved efficiently using graph cuts. 

3.4 Multiple Segmentations 

Segmenting an RGB-D video in time can be noisy, and multiple segmentations may be valid. There- 
fore, we will perform multiple segmentations by using different methods and criterion of segmenta- 
tion (see Section [4] for details). Thus, we get a set H of multiple segmentations, and let h n be the 
n th segmentation. A discriminant function f^h n (x. hn , y hn ) can now be defined for each h n as in 
Eq. ©. We now define a score function ge(y hn , y) which gives a score for assigning the labels of 
the segments from y hn to y, 

s«„(y"",y) = £ £<£y, hn V? (7) 

where K = K s U K a . Here, Q\ can be interpreted as the confidence of labeling the segments of 
label k correctly in the n th segmentation hypothesis. We want to find the labeling that maximizes 
the assignment score across all the segmentations. Therefore we can write inference in terms of a 
joint objective function as follows 

y = argmax max V [/ w * n (x h », y h «) + g e (y h « , y)] (8) 



http://www.tfinley.net/software/pyglpk/readme.html 
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Tab le 1: Results on Cornell 60 Activity Dataset [23], tested on "New Person" data for 12 activity cla sses 





bathroom 
prec rec 


bedroom 
prec rec 


kitchen 
prec rec 


living room 
prec rec 


office 
prec rec 


Average 
prec rec 


Sung et. al. [23J 
Our method 


72.7 65.0 
88.9 61.1 


76.1 59.2 
73.0 66.7 


64.4 47.9 
96.4 85.4 


52.6 45.7 
69.2 68.7 


73.8 59.8 
76.7 75.0 


67.9 55.5 
80.8 71.4 



This formulation is equivalent to considering the labelings y hn over the segmentations as unobserved 
variables. It is possible to use the latent structural SVM [32] to solve this, but it becomes intractable 
if the size of the segmentation hypothesis space is large. Therefore we propose an approximate 
two-step learning procedure to address this. For a given set of segmentations %, we first learn the 
parameters w hn independently as described in Section 13.31 We then train the parameters 6 on a 
separate held-out training dataset. This can now be formulated as a QP: 

1 |H| 

m i n 2 ° T ° ~ ^ 99n {yhn 5 y)s - tyk G K : ^ °" = 1 

h n GH n=l 

Using the fact that the objective function defined in Eq. ([5]) is convex, we design an iterative two- 
step procedure where we solve for y hn , V/i n G H in parallel and then solve for y. This method 
is guaranteed to converge, and when the number of variables scales linearly with the number of 
segmentation hypothesis considered, the original problem in Eq. ([8]) will become considerably slow, 
but our method will still scale. More formally, we iterate between the following two problems: 

y hn = argmax/ w h n (x' ln ,y' ln ) +go n (y hn ,y) (9), y = argmax^ (y hn , y) (10) 

yh n y 

High-level Activity Classification. For classifying the high-level activity, we compute the his- 
tograms of sub-activity and affordance labels and use them as features. Since the occlusion of 
objects also plays a major role in some activities, we capture this by including in our feature vector, 
the fraction of objects that are occluded fully or partially in the temporal segments. We then train a 
multi-class SVM classifier on training data using these features. 

4 Experiments 

Data. We test our model on two 3D activity datasets: Cornell 60 Ii23l1 and Cornell 120 Activity 
Datasets. The Cornell 60 Activity Dataset 12311 contains 60 3D videos of four different subjects 
performing 12 high-level activity classes. However, some of these activity classes contain only 
one sub-activity (e.g., working on a computer, cooking (stirring), etc.) and do not contain object 
interactions (e.g., talking on couch, relaxing on couch). 

We introduce the Cornell 120 Activity Dataset dataset , which contains activity sequences of ten dif- 
ferent high-level activities performed by four different subjects, where each high-level activity was 
performed three times. We thus have 61,585 total 3D video frames in our dataset. The high-level 
activities are: {making cereal, taking medicine, stacking objects, unstacking objects, microwaving 
food, picking objects, cleaning objects, taking food, arranging objects, having a meal}. The subjects 
were only given a high-level description of the taskQ and were asked to perform the activities multi- 
ple times with different objects. For example, the stacking and unstacking activities were performed 
with pizza boxes, plates and bowls. They performed the activities through a long sequence of sub- 
activities, which varied from subject to subject significantly in terms of length of the sub-activities, 
order of the sub-activities as well as in the way they executed the task. The camera was mounted 
so that the subject was in view (although the subject may not be facing the camera), but often there 
were significant occlusions of the body parts. See Fig.[T]for some examples. 

We labeled our dataset with the sub-activity and the object affordance labels. We refer to the set 
of contiguous frames spanning a sub-activity as a temporal segment. Specifically, our sub-activity 
labels are: {reaching, moving, pouring, eating, drinking, opening, placing, closing, scrubbing, null} 
and our affordance labels are: {reachable, movable, p our able, pourto, containable, drinkable, open- 
able, placeable, closable, scrubbable, scrubber, stationary}. Table 1 in supplementary material 
shows the details on which sub-activities are present in each high-level activity. 

Preprocessing. Given the raw data containing the color and depth values for every pixel in the 
video, we first tracked the human skeleton using Openni's skeleton tracker W26I1 for obtaining the 
locations of the various joints of the human skeleton. However these values are not very accurate, as 
the Openni's skeleton tracker is only designed to track human skeletons in clutter- free environments 
and without any occlusion of the body parts. In real-world human activity videos, some body parts 
are often occluded and the interaction with the objects hinders accurate skeleton tracking. We show 



2 For example, the instructions for making cereal were: 1) Place bowl on table, 2) Pour cereal, 3) Pour milk. 
For microwaving food, they were: 1) Open microwave door, 2) Place food inside, 3) Close microwave door. 
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Table 2: Results on Cornell 120 Activity Dataset, showing average micro precision/recall, and average macro 



precision and recall for affordance, sub- activities and high-level activities. Standard error is also reported. 





Object Affordance 


Sub-activity 


High-level Activity 


micro macro 


micro macro 


micro macro 


method 


P/R Prec. Recall 


P/R Prec. Recall 


P/R Prec. Recall 


max class 
image only 
SVM multiclass 
MEMMf^ 


65.7 ± 1.0 65.7 ±1.0 8.3 ± 0.0 
74.2 ± 0.7 15.9 ± 2.7 16.0 ± 2.5 
75.6 ± 1.8 40.6 ±2.4 37.9 ± 2.0 


29.2 ±0.2 29.2 ±0.2 10.0 ± 0.0 
56.2 ±0.4 39.6 ±0.5 41.0 ±0.6 
58.0 ± 1.2 47.0 ±0.6 41.6 ±2.6 


10.0 ±0.0 10.0 ±0.0 10.0 ±0.0 
34.7 ±2.9 24.2 ± 1.5 35.8 ± 2.2 
30.6 ± 3.5 27.4 ±3.6 31.2 ±3.7 
26.4 ±2.0 23.7 ± 1.0 23.7 ± 1.0 


object only 

sub-activity only 

no temporal interactions 

no object interactions 

full model: groundtruth seg 


86.9 ± 1.0 72.7 ± 3.8 63.1 ±4.3 

87.0 ±0.8 79.8 ± 3.6 66.1 ± 1.5 
88.4 ±0.9 75.5 ± 3.7 63.3 ± 3.4 
91.8 ±0.4 90.4 ±2.5 74.2 ±3.1 


71.9 ±0.8 60.9 ±2.2 51.9 ±0.9 
76.0 ±0.6 74.5 ± 3.5 66.7 ± 1.4 
85.3 ± 1.0 79.6 ±2.4 74.6 ± 2.8 
86.0 ±0.9 84.2 ± 1.3 76.9 ± 2.6 


59.7 ± 1.8 56.3 ± 2.2 58.3 ± 1.9 
27.4 ±5.2 31.8 ±6.3 27.7 ± 5.3 
81.4 ± 1.3 83.2 ± 1.2 80.8 ± 1.4 

80.6 ±2.6 81.9 ±2.2 80.0 ± 2.6 

84.7 ±2.4 85.3 ±2.0 84.2 ± 2.5 



Full model. End-to-end results, without assuming any ground-truth temporal segmentation is given. 



full, 1 segment, (best) 


83.1 


± 1.1 


70.1 


±2.3 


63.9 


± 


4.4 


66.6 


± 


0.7 


62.0 


±2.2 


60.8 


± 


4.5 


77.5 


±4.1 


80.1 ±3.9 


76.7 


± 


4.2 


full, 1 segment, (averaged) 


81.3 


±0.4 


67.8 


± 1.1 


60.0 


± 


0.8 


64.3 


± 


0.7 


63.8 


± 1.1 


59.1 


± 


0.5 


79.0 


±0.9 


81.1 ±0.8 


78.3 


± 


0.9 


full, multi-seg learning 


83.9 


± 1.5 


75.9 


±4.6 


64.2 


± 


4.0 


68.2 


± 


0.3 


71.1 


± 1.9 


62.2 


± 


4.1 


80.6 


± 1.1 


81.8 ± 2.2 


80.0 


± 


1.2 




Figure 3: Confusion matrix for affordance labeling (left), sub-activity labeling (middle) and high-level activity 
labeling (right) of the test RGB-D videos. 



that even with such noisy data, our method gets high accuracies by modeling the mutual context 
between the affordances and sub-activities. 

We then use SIFT feature matching 12511 . while enforcing depth consistency across the time frames 
for obtaining reliable object tracks in the 3D video. In our current implementation, we need to 
provide the bounding box of the objects involved in the activity in the first frame of the video. Note 
that these are just bounding boxes and not object labels. In future, such bounding boxes of objects 
of interest can also be obtained in a preprocessing step by either running a set of object detectors 
l33h or some other methods that use 3D information OEH]. The outputs from the skeleton and 
object tracking along with the RGBD videos are then used to generate the features. 

Labeling results on the Cornell 60 Activity Dataset. Table [T] shows the precision and recall of the 
high-level activities on the Cornell Activity Dataset [23]. Following Sung et. al.'s 12311 experiments, 
we considered the same five groups of activities based on their location, and learnt a separate model 
for each location. To make it a fair comparison, we do not assume perfect segmentation of sub- 
activities and do not use any object information. Therefore, we train our model with only sub-activity 
nodes and consider segments of u nifo rm size (20 frames per segments). We consider only a subset of 
our features described in Section [33] that are possible to compute from the tracked human skeleton 
and RGBD data provided in this dataset. Table Q] shows that our model significantly outperforms 
Sung et. al.'s MEMM model even when using only the sub-activity nodes and a simple segmentation 
algorithm. More detailed results are provided in the supplementary material. 

Labeling results on Cornell 120 Activity Dataset. Table [2] shows the performance of various 
models on object affordance, sub-activity and high-level activity labeling. These results are obtained 
using 4-fold cross-validation and averaging performance across the folds. Each fold constitutes the 
activities performed by one subject, therefore the model is trained on activities of three subjects and 
tested on a new subject. We report both the micro and macro averaged precision and recall over 
various classes along with standard error. Since our algorithm can only predict one label for each 
segment, micro precision and recall are same as the percentage of correctly classified segments. 
Macro precision and recall are the averages of precision and recall respectively for all classes. 

Assuming ground-truth temporal segmentation is given, the results for our full model are shown in 
Table [2] on line 9, its variations on lines 5-8 and the baselines on lines 1-3. The results in lines 
10-12 correspond to the case when temporal segmentation is not assumed. In comparison to a basic 
SVM multiclass model PTll ( referred to as SVM multiclass when using all features and image only 
when using only image features), which is equivalent to only considering the nodes in our MRF 
without any edges, our model performs significantly better. We also compare with the high-level 
activity classification results obtained from the method presented in [23]. We ran their code on 
our dataset and obtain accuracy of 26.4%, whereas our method gives an accuracy of 84.7% when 
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Subject opening Subject reaching Subject moving Subject placing Subject reaching Subject closing 

openable object 1 reachable object2 movable object2 placable object2 reachable object 1 closable object 1 

Figure 4: Output of our algorithm: Sequence of images from the taking food activity labeled with sub-activity 
and object affordance labels. 

ground truth segmentation is available and 80.6% otherwise. Figure HI shows a sequence of images 
from taking food activity along with the inferred labels. Figure [3] shows the confusion matrix for 
labeling affordances, sub-activities and high-level activities with our proposed method. We can see 
that there is a strong diagonal with a few errors such as scrubbing misclassified as placing, and 
picking objects misclassified as arranging objects. 

We analyze our model to gain insight into which interactions provide useful information by compar- 
ing our full model to variants of our model. 

How important is object context for activity detection? We show the importance of object context 
for sub-activity labeling by learning a variant of our model without the object nodes (referred to as 
sub-activity only). With object context, the micro precision increased by 14.1% and both macro pre- 
cision and recall increased by around 23.3% over sub-activity only. Considering object information 
(affordance labels and occlusions) also improved the high-level activity accuracy by 3 -fold. 

How important is activity context for affordance detection? We also show the importance of 
context from sub-activity for affordance detection by learning our model without the sub-activity 
nodes (referred to as object only). With sub-activity context, the micro precision increased by 4.9% 
and the macro precision and recall increased by 17.7% and 11.1% respectively for affordance label- 
ing over object only. The relative gain is less compared to that obtained in sub-activity detection as 
the object only model still has object-object context which helps in affordance detection. 

How important is object - object context for affordance detection? In order to study the effect 
of the object-object interactions for affordance detection, we learnt our model without the object- 
object edge potentials (referred to as no object interactions). We see a considerable improvement in 
affordance detection when the object interactions are modeled, the macro recall increased by 14.9% 
and the macro precision by about 10.9%. This shows that sometimes just the context from the human 
activity alone is not sufficient to determine the affordance of an object. 

How important is temporal context? We also learn our model without the temporal edges (referred 
to as no temporal interactions). Modeling temporal interactions increased the micro precision by 
4.8% and 10.0% for affordances and sub-activities respectively and increased the micro precision 
for high-level activity by 3.3%. 

End-to-end Resu lts. Given the RGB-D video, we obtain the final labeling using our method de- 
scribed in Section 13.41 To generate the segmentation hypothesis set H we consider three different 
segmentation algorithms, and generate multiple segmentations by changing their parameters. We 
consider: 1) uniform segmentation parameterized by segment size, 2) graph based segmentation 
[36] with edge weights defined using the displacement of skeletal joints, and 3) graph based seg- 
mentation with edge weights defined using rate of change of skeletal joint locations. The lines 10-12 
of Table [2 show the results of the best performing segmentation, average performance all the seg- 
mentations considered, and our proposed method for combining the segmentations. We see that our 
method improves the performance over considering a single best performing segmentation: macro 
precision increased by 5.8% and 9.1% for affordance and sub-activity labeling respectively. 

5 Conclusion 

In this paper, we considered the task of jointly labeling human activities and object affordances from 
RGB-D videos. The activities we consider happen over a long time period, and comprise several 
sub-activities performed in a sequence. We formulated this problem as a Markov Random Field, 
and learned the parameters of the model using a structural SVM formulation. Our model also in- 
corporates the temporal segmentation problem by computing several segmentations and considering 
labeling over these segmentations as latent variables. In extensive experiments over a challenging 
dataset, we show that our method achieves an end-to-end accuracy precision of 81.8% and recall of 
80.0% for labeling the activities performed by a different subject than the ones in the training set. We 
also showed that it is important to model the different properties (object affordances, object-object 
interaction, temporal interactions, etc.) in order to achieve good performance. 
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